Highly Efficient Compensation-based Parallelism for Wavefront Loops on GPUs
نویسندگان
چکیده
Wavefront loops are widely used in many scientific applications, e.g., partial differential equation (PDE) solvers and sequence alignment tools. However, due to the data dependencies in wavefront loops, it is challenging to fully utilize the abundant compute units of GPUs and to reuse data through their memory hierarchy. Existing solutions can only optimize for these factors to a limited extent. For example, tiling-based methods optimize memory access but may result in load imbalance; while compensation-based methods, which change the original order of computation to expose more parallelism and then compensate for it, suffer from both global synchronization overhead and limited generality. In this paper, we first prove under which circumstances that breaking data dependencies and properly changing the sequence of computation operators in our compensation-based method does not affect the correctness of results. Based on this analysis, we design a highly efficient compensation-based parallelism on GPUs. Our method provides weighted scanbased GPU kernels to optimize the computation and combines with the tiling method to optimize memory access and synchronization. The performance results on the NVIDIA K80 and P100 GPU platforms demonstrate that our method can achieve significant improvements for four types of real-world application kernels over the state-of-the-art research.
منابع مشابه
Model-Driven Tile Size Selection for DOACROSS Loops on GPUs
DOALL loops are tiled to exploit DOALL parallelism and data locality on GPUs. In contrast, due to loop-carried dependences, DOACROSS loops must be skewed first in order to make tiling legal and exploit wavefront parallelism across the tiles and within a tile. Thus, tile size selection, which is performance-critical, becomes more complex for DOACROSS loops than DOALL loops on GPUs. This paper pr...
متن کاملMapping dynamic programming algorithms on graphics processing units
The Graphics Processing Unit (GPU) is a highly parallel, many-core streaming architecture that can execute hundreds of threads concurrently. The data parallel architecture of the GPU is suitable to perform computation intensive applications. In recent years, the use of GPUs for general purpose computation has increased and a large set of problems can be tackled by mapping onto GPUs. The program...
متن کاملWavefront Scheduling : Path Based Data Representation andScheduling
The IA-64 architecture is rich with features that enable aggressive exploitation of instruction-level parallelism. Features such as speculation, predication, multiway branches and others provide compilers with new opportunities for the extraction of parallelism in programs. Code scheduling is a central component in any compiler for the IA-64 architecture. This paper describes the implementation...
متن کاملCompilation of Modelica Array Computations into Single Assignment C for Efficient Execution on CUDA-enabled GPUs
Mathematical models, derived for example from discretisation of partial differential equations, often contain operations over large arrays. In this work we investigate the possibility of compiling array operations from models in the equation-based language Modelica into Single Assignment C (SAC). The SAC2C SAC compiler can generate highly efficient code that, for instance, can be executed on CU...
متن کاملEfficient irregular wavefront propagation algorithms on hybrid CPU-GPU machines
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elemen...
متن کامل